Multivariate Pólya distribution

The multivariate Pólya distribution, named after George Pólya, also called the Dirichlet compound multinomial distribution, is a compound probability distribution, where a probability vector p is drawn from a Dirichlet distribution with parameter vector \alpha, and a set of discrete samples is drawn from the categorical distribution with probability vector p. The compounding corresponds to a Polya urn scheme. In document classification, for example, the distribution is used to represent probabilities over word counts for different document types.

Contents

Probability mass function

We are doing N independent draws from a categorical distribution with K categories. Let x=(n1,n2,...,nK) denote the vector of counts, where nk is the number of times category k was drawn. If the parameter of the categorical distribution is given as p=(p1,p2,...,pK), where p_k is the probability to draw value k, the probability distribution for counts, P(x|p) is given by the associated multinomial distribution with parameter p. But now p is not given, but instead considered drawn from a Dirichlet distribution with parameter vector \boldsymbol\alpha=(\alpha_1,\alpha_2,\ldots,\alpha_K). The resulting compound distribution is obtained by integrating out p:

\Pr(\mathbf{x}\mid\boldsymbol{\alpha})=\int_{\mathbf{p}}\Pr(\mathbf{x}\mid \mathbf{p})\Pr(\mathbf{p}\mid\boldsymbol{\alpha})\textrm{d}\mathbf{p}

which results in the following explicit formula:

\Pr(\mathbf{x}\mid\boldsymbol{\alpha})=\frac{N!}
{\prod_{k}\left(n_{k}!\right)}\frac{\Gamma\left(A\right)}
{\Gamma\left(N%2BA\right)}\prod_{k}\frac{\Gamma(n_{k}%2B\alpha_{k})}{\Gamma(\alpha_{k})}

where \Gamma is the gamma function, with

A=\sum_k \alpha_k \,\text{and}\; N=\sum_k n_k.

Another form

The probability mass function may be written more compactly in terms of the beta function, as follows:

\Pr(\mathbf{x}\mid\boldsymbol{\alpha})=\frac{N B\left(A,N\right)}
{\prod_{k:n_k>0} n_k B\left(\alpha_k,n_k \right)}

where B is the beta function.

Related distributions

The one-dimensional version of the multivariate Pólya distribution is known as the Beta-binomial distribution.

Uses

The multivariate Pólya distribution is used in automated document classification and clustering, genetics, economy, combat modeling, and quantitative marketing.

See also

References